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Abstract 

Bayesian sequential testing of multiple simple hypotheses is a classical sequential decision problem. However, 
the optimal policy is computationally intractable in general, as the posterior probability space is exponentially 
increasing in the number of hypotheses (the curse of dimensionality in state space). We consider a specialized 
problem in which observations are drawn from the same exponential family. By reconstructing the posterior 
probability vector from a low-dimensional diagnostic sufficient statistic, it is shown that the intrinsic dimension of 
the reachable posterior probability space is determined by the rank of a diagnostic matrix, which cannot exceed 
the number of parameters governing the exponential family, or the number of hypotheses, whichever is smaller. 
For univariate exponential families commonly used in practice, the probability space is of one or two dimension in 
most cases. Hence, the optimal policy can computed in an efficient manner. Geometric interpretation and illustrative 
examples are presented. Simulation studies suggest that the optimal policy can substantially outperform the existing 
method. The results are also extended to the sequential sampling control problem. 

Index Terms 

Curse of dimensionality; Dynamic programming; Exponential family; Partially observable Markov decision 
processes (POMDP); Sequential multi-hypothesis testing; Sampling control 


1. Introduction 


S EQUENTIAL multi-hypothesis testing is a generalization of standard statistieal hypothesis testing to 
aeeount for sequential observations and multiple alternative hypotheses. After obtaining new observa¬ 
tions, the deeision maker ean stop and aeeept one of multiple hypotheses about the underlying statistieal 
distribution, or wait for more observations in the hope of improving the aeeuraey of future deeisions. The 
goal is to identify the true hypothesis as quiekly as possible and with a desired aeeuraey, whieh ean often 
be translated to minimizing the expeeted eost ineurred by aeeepting an ineorreet hypothesis and making 
more observations. 

This problem is a elassieal sequential deeision-making problem. It involves a trade-off between the 
identifieation aeeuraey and time delay, whieh arises in a vast array of applieations ineluding medieal 
diagnosties (D, supervised maehine learning O, network seeurity JSlI, as well as edueational testing, 
physiologieal monitoring, elinieal trials and military target reeognition, see flU for a eomprehensive 
diseussion. 

The study of sequential hypothesis testing is originated with Wald Dll, who proposed a Bayes-optimal 
proeedure for binary simple hypotheses ealled the sequential probability ratio test (SPRT). In SPRT, the 
deeision maker observes independent and identieally distributed (iid.) samples of a statistieal distribution, 
one at a time, and ealeulates the probability ratio refleeting how likely one distribution (hypothesis) is 
true when eompared to the other. If this ratio strongly favors one hypothesis, then she should stop and 
aeeept that hypothesis. Otherwise, she ean keep observing. 

The generalization of SPRT to multiple simple hypotheses has also been eonsidered by D3, but the 
Bayes-optimal poliey is found to be extremely diffieult to implement in praetiee, even for only three 
hypotheses. The strueture of the optimal poliey, on the other hand, is well understood IT]. Numerous 
heuristies (e.g., DU, DU, DTOl . DTTI ) and asymptotieally optimal proeedures (e.g.. Dill, DT3l . DEI) have 
been studied, exeept the optimal poliey, as noted by a reeent review Dll. 
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One can view the sequential multi-hypothesis testing problem as a special type of partially observable 
Markov decision process problem (POMDP) with identity transition matrix IfTSl . The main difficulty in 
generalizing the optimal policy stems from the curse of dimensionality in dynamic programming. In the 
presence of two hypotheses, it suffices to consider the posterior probability of just one hypothesis, which 
is a scalar. But in the face of more than two hypotheses, one must consider a posterior probability vector 
(also called the belief vector). The size of the posterior probability space (or belief space) increases 
exponentially in the number of hypotheses, making the problem notoriously difficult to solve [jT^. 

In this paper, we first consider a specialization in which the distributions in all hypotheses come from 
the same exponential family. Exponential families play a central role in statistical theory and include many 
parametric distributions (e.g., normal, binomial, Poisson) commonly used in practice. We find that the 
integration of sequential hypothesis testing and exponential families gives rise to a unique property, which 
allows us to reconstruct the A^-dimensional posterior probability vector from an r-dimensional vector, 
which we shall call the diagnostic sufficient statistic, or DSS. Here, N is the number of hypotheses and r 
is what we call the diagnostic intrinsic dimension, which is the rank of the diagnostic matrix H, an N xM 
matrix that include all natural parameters of the hypotheses, where M is the dimension of the exponential 
family. Clearly, r = rank(if) < min{A^, M}. That is, the diagnostic intrinsic dimension is bounded by the 
number of hypotheses and the dimension of exponential family. For many univariate exponential-family 
distributions commonly used in practice, such as normal, binomial, Poisson (see Table |^, we have M <2. 
This means that, for these distributions, the dynamic programming problem can be reformulated in a 
space with at most two dimension, and therefore the optimal policy can be easily found through dynamic 
programming, even when the number of hypotheses N is large. 

It is important to note that the proposed optimal solution method is not restricted to low-dimensional 
exponential families. Since r is the rank of an iV x M matrix, it can be much smaller than both N 
and M. This happens when, for example, multicollinearity is present in the natural parameters associated 
with different hypotheses, which gives rise to a low-rank diagnostic matrix. In this situation, the optimal 
solution method can be applied to high-dimensional exponential families as well. 

A. Relevant Literature 

Sequential multi-hypotheses testing is a classical problem in statistical decision theory, therefore the 
relevant literature is substantial. We can only provide a sketch in this paper, but refer the readers to [jH for a 
comprehensive review. In general, there are two streams of literature: Bayes-optimal policy and suboptimal 
policies. The stream of optimal policy has been focusing on the geometric structure of acceptance regions 
in the belief space. The stream of suboptimal policies focuses on practical solution procedures, which can 
be further divided into heuristic policies and asymptotically optimal policies. Note that “multi-hypotheses 
testing” in this paper refers to the “identification” problem and should not be confused with the multiple 
testing problem or multi-armed bandit problem. 

Optimal policy: The Bayes-optimal policy was first examined by 0, who formulated the problem in 
the belief space and showed that the optimal acceptance region for each hypothesis is convex and contains 
a vertex of the probability simplex. This structure can also be represented in terms of a conditional control 
limit policy [fTTI . Sequential multi-hypothesis testing has also been integrated with change detection in [|T8l , 
who also characterized the geometric properties of the acceptance regions in the belief space. However, 
the intrinsic complexity renders the optimal implementation impractical. In this paper, we focus on the 
practical solution method instead of structural results. 

Suboptimal policy: Many heuristic approaches are based on the parallel implementation of multiple 
pairwise SPRTs. To test three hypotheses concerning the mean of normal distribution, IfT^ constructed 
two SPRTs for two different pairs of hypotheses and specified a series of heuristic decision rules. BSl 
extended this procedure to a general number of hypotheses. A different modification of the acceptance 
regions was given by ifT^ . Representative procedures along this line are compared by ll20l . However, 
these methods have been developed without much consideration on optimality. 
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An intuitively appealing approach called the M-ary sequential probability ratio test (MSPRT) is proposed 
in IIT^ . It is a decoupled likelihood ratios test and has been shown to be asymptotically optimal when the 
observation costs approach zero or when the probabilities of incorrect selection approach zero [1131 . [|2T1. 
These limiting situations would arise where one can afford to obtain a substantial amount of information 
before making the final selection, or when the alternative hypotheses are easily distinguishable from each 
other. Asymptotically optimal solutions are also the foundation for recent developments of sequential joint 
detection and identification [|2^ . decentralized sensing [|23l . as well as sampling control [I24l . Il25l . IfTSl . 
Note that dynamic programming is generally not involved in suboptimal policies, whereas it is almost 
inevitable in the search for the optimal policy. 

Sequential testing with exponential family: Many heuristic policies have been developed for normal 
distribution ifT^ . Il8l . lITOll . yet none claims optimality. As mentioned earlier, the Sobel-Wald procedure, 
as well as its extensions, are based on multiple SPRTs that are operated simultaneously, in which one 
must specify some coordination rules to manage potential conflicts among these parallel testings. One 
approach that does not involve multiple SPRTs is proposed by ll2^ for the testing of three hypotheses 
about the normal mean. However, it prohibits accepting any hypothesis at the early stage. This contradicts 
the optimal policy found in Section |III-C1 of this paper. Detailed reviews of heuristic procedures targeting 
specific exponential family can be found in llTTll . To our best knowledge, no optimal solution method that 
is scalable for multiple hypotheses about exponential families can be found in the literature. 


B. Contributions and limitations 

The main contribution of this paper is to show that a practical optimal solution method, in which 
computational complexity does not grow in response to the number of hypotheses, is possible in many 
practice-relevant cases. This method is based on reconstructing the high-dimensional belief vector using 
a low-dimensional diagnostic sufficient statistic. 

An advantage of the proposed method is that it is compatible with any discrete prior distribution, 
offering great flexibility to application. However, in contrast to existing approach [|6l where the acceptance 
thresholds are fixed and independent of the prior, our method generates a set of acceptance regions that are 
non-stationary and prior-dependent, which may add some complexity to the implementation. Nevertheless, 
numerical experiments suggest that the optimal policy can substantially outperform the popular suboptimal 
method when the hypotheses are difficult to differentiate and the delay penalties are high. 

This paper is organized as follows. Section describes the problem, its standard formulation and 
solution procedure. Section [nl] describes the belief-vector reconstruction and the reformulation of opti- 

I compares the performance of 


mality equation, illustrated with applications to open problems. Section |_ 

the optimal solution with the existing suboptimal procedure. Section |V] extends the results to sampling 
control problems. The summary and discussions are given in Section I 


H. Preliminaries 

Consider a sequence of iid. observations {Yi, Y 2 ,...}, continuous or discrete, with probability density 
(or mass) function / defined on 3^ C This distribution is unknown, but there are a finite number 
of distinct hypotheses about it, more specifically. Hi : f = fi,i = 0,..., N, where {/o, /i,..., /at} are 
known distributions and one of them is equal to /. At time k, after observing the sequence {Yi,..., Yk}, 
we must choose an action among the following alternatives: stop and accept hypothesis Hi, where i G 
J\f = {0,..., A^}, or wait until the next period and make a new observation, Yfc+i. The decision process 
is terminated if we choose to stop. In Section |V| we will consider a more general problem with multiple 
sampling modes. 

We hope to identify the true distribution with a desirable accuracy as quickly as possible. A sequential 
policy S = (r, d) contains a stopping time r with respect to the historical observations, and an acceptance 
decision rule d taking value in the set J\f. The decision process is terminated at time r when we stop 
observing and, if d = i, we accept hypothesis Hi. We let A denote the set of admissible policies in 
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which the stopping and acceptance decisions are based on the information available at time r. Suppose 
that hypothesis Hi is true (namely, the actual distribution is / = /j), if we stop and accept hypothesis 
Hj, then a termination cost aij > 0 is incurred, where aij = 0 if i = j (no penalty for a correct 
identification) and aij > 0 if i ^ j (penalty for misidentification). If we wait, then an observation 
cost Ci > 0 is incurred per period. Before any observation is obtained, some prior belief about the 
true hypothesis is available. Let 0 < 9i < 1 denote the prior probability that the hypothesis i is true. 
Clearly, = 1- Let 6 = {9o,e,,... ,9^) be the prior belief vector in an A^-dimensional belief 

space = {n = (ttq, vti, ..., tttv) G [0,| ttq + tti + ■ ■ ■ + ttat = 1}. Our objective is to find the 
Bayes-optimal policy, given the prior belief, that minimizes the total expected cost over an infinite horizon 

N N N 

R* = inf {5: TCiI{f=f.} -f I{r<oo} EE 

^ i=0 i=0 j=0 


where R* is also referred to as the minimum Bayes risk. Given the prior belief 6, let 11^ = (7r^,7r^...,7r^) G 
be the belief vector at period k, where ir’- is the posterior probability that f = fj. Clearly, X]j=o ^ 

By Bayes’ rule, we have 








where g{y) = diag{fo{y), fi{y),fN{y)) is a diagonal matrix, and R{y) = {fo{y), fi{y),fN{y)V 
is a column vector. Note that depends on the prior 6 through IIo = 0. Let Fi{y) denote the distribution 
function corresponding to fi{y), and F(?/) = {Fo{y), ..., FN{y)y. 

The general dynamic programming formulation uses the belief vector 11^ G as the state variable. 
The state space is the A^-dimensional belief space BTl, with the following optimality equation: 

L(n) = min |co(n), Ci(n),..., C^(n), I4(n)}, (2) 

N 

Li(n) = TTiaij, j = 0,...,N, 

i=0 

TTiCi + j 

JycM^ 


V{B{U,y))Ud¥{y). 


(3) 


K.(n) = 

i=0 


One can interpret the value function V (IT) as the minimum expected cost to go given the current 
belief vector 11 G S^. l^ (n) is the expected cost of accepting the hypothesis j immediately. 14; (H) is the 
expected cost of deciding to wait for one more period, incur the observation cost, collect a new observation 
y G y at the next period, and make optimal decisions onward. This optimality equation suffers from the 
curses of dimensionality in both the state space and outcome space, because the belief space grows 
exponentially in N and the integral in Q is taken over a D-dimensional space. 

It has been shown that the solution to the optimality equation is unique BTl, and the optimal policy 
is a stationary policy that chooses the action minimizing the right-hand side of Let T^ = {n G 
: I^(n) = 14 ( 11 )} be the optimal acceptance region for the hypothesis j. It is optimal to accept the 
hypothesis j as soon as the belief vector If enters this region, or wait if If ^ T^, for all j G M. One can 
implement the optimal policy by computing all the acceptance regions and comparing the belief vector 
against them to make decisions. We refer to this as the belief-vector procedure in this paper, as illustrated 
in the upper panel of Figure]^ Note that this procedure has two notable features: (1) T/s do not change 
over time. (2) T/s are independent of the prior 6. Wald’s SPRT and most asymptotic optimal policies 
such as iTO . lfT3ll also have similar features. In contrast, we will show that the optimal acceptance regions 
for diagnostic sufficient statistic depend on both time and the prior. 
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TABLE I 

Univariate exponential family commonly used in practice 


Distribution 

p.d.f (p.m.f) 

a 

T]' (a) 

t'{y) 

BA) 

M 

Beta 

r(a + /3) yl/S-l 

(a, A 

(aA) 

(lny,ln(l - y)) 

7(«)+7(^)-7(a + ^) 

2 

Binomial 

Hpy{i-p)i-y) 

P 


y 

—n ln(l — p) 

1 

Chi-squared 

' 1 F 1 -V 

V 

^-1 

2 

Iny 

lnr(|)-b |ln2 

1 

Exponential 


A 

-A 

y 

— In A 

1 

Gamma 


(a, A) 

(a,-A) 


7 (a) — a In A 

2 

Geometric 

p(i-p)y 

P 

ln(l -p) 

y 

— Inp 

1 

Laplace^ 

^exp(-^) 

b 

-1/b 

\y- y\ 

In 26 

1 

Lognormal 

i oxn f I 

A, A) 


{\ny,(inyf) 


2 

Neg. binomial 

(7)p“(p-i)^ 

{a,p) 

In(l-p) 

y 

—a Inp 

1 

Normal 

1 pxn -f I 

A, A) 


{y,y^) 


2 

Pareto 

'PcP 

„(^+l) 

{a,P) 

-P 

Iny 

1 

7 

1 

Poisson 

w! ® 

A 

In A 

y 

A 

1 

Rayleigh 

5exp(-^) 


1 

y^ 

ilncr^ 

1 

Weibull^ 


A 

1 

A2 

y'^ 

7 In A — In 7 

1 


1 . with fixed mean 

2 . with fixed shape parameter 7 


III. Proposed Solution Method 

We first provide a brief review of distributions in the exponential family, then deseribe the optimal 
solution procedure, and finally provide illustrative examples. 

Exponential families contain many important distributions widely used in practice. From now on, when 
we mention distribution, we refer to the distribution in an exponential family. 

Definition 1 (Exponential Family): Consider a random variable Y that takes on value in some space 
y C Let f{y; ct) be a probability density (or mass) function parameterized by cx from some parameter 
set A. The distribution is an M-parameter exponential family if 

f{y,OL) = h{y)exp[r]'^{oL)t{y) - B{a.)], y^y, (4) 

for some underlying measure h : y ^ M+, natural parameter vector rj : A ^ natural sufficient 
statistic vector t : y ^ and normalization term B : A ^ M., where denotes the transpose of rj. 

See Table |I] for some popular univariate distributions along with their parameters. Note that these 
distributions have only one or two parameters, i.e., M ^ 2. But some univariate exponential fa mi lies may 
have M > 2, although they are not as common in practice as those in Table |I| The results of this paper 
are based on the following assumption: 

Assumption 1: The observations are independent and identically distributed (iid.), drawn from the same 
exponential family. 

The iid. assumption is standard in sequential decision models. As mentioned earlier, exponential fa mi lies 
have also been extensively studied in sequential hypotheses testing and widely used in practice. Thus, this 
assumption is considered standard in the literature and has important practical roots. 

We will show that a high-dimensional belief vector may be reduced to a low-dimensional diagnostic 
sufficient statistic (DSS) without loss of information. Consequently, the dimension of the state space can be 
significantly reduced. We identify the following two opportunities for dimension reduction (as illustrated 
in Figure [^: 

1) Opportunity 1: the existence of natural sufficient statistic in exponential families. The belief vector 
is often chosen as the state variable of dynamic programming because it is a sufficient statistic of 
the observation-control history. We note that the sufficient statistic is not necessarily unique. For 
example, exponential families have their own sufficient statistic: the natural sufficient statistic. The 
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Opportunity 1 


Opportunity 2 


N -► M -► r 

Dimension of belief space Dimension of natural Diagnostic intrinsic dimension 

sufficient statistic 


Fig. 1. Two opportunities for dimension reduction 

dimension of natural sufficient statistic is denoted by M. See Table |I] for some examples. The first 
opportunity of dimension reduction is that the iV-dimensional belief vector can be reconstructed 
from the M-dimensional natural sufficient statistic. 

2) Opportunity 2: we can further reduce the dimensionality by exploiting the linear dependence 
(if any) in the natural parameters associated with multiple hypotheses. This opportunity is less 
intuitive but can be interpreted as follows: if the natural parameters of a group of hypotheses are 
linearly dependent, then the information that differentiates any two hypotheses can be also used to 
differentiate between other hypotheses in the group. This allows us to reduce the M-dimensional 
natural sufficient statistic further down to an r-dimensional diagnostic sufficient statistic. 

We now describe the optimal solution method, which consists of two simple steps: belief vector 
reconstruction and the reformulation of optimality equation. Each step is detailed in the below: 

A. Belief vector reconstruction 

Definition 2: Define the diagnostic matrix as 

■ 

H= I 

_ r]'^{aN) - r]'^{ao) . 

where rj(cxi) is the natural parameter vector for distribution fifi G Af. 

The diagnostic matrix H is an A^-by-M matrix, where N is the number of hypotheses and M is 
the number of parameters in the exponential family. Let r = rank(if) denote its rank. Clearly, r < 
min{A^, M}. Since any matrix has a rank factorization, we can always find two matrices, L and U, such 
that 

H = LU, 

where L is an iV-by-r matrix of full column rank and U is an r-by-M matrix of full row rank[^ The 
full-row-rank matrix U will be used to construct the diagnostic sufficient statistic: 

Definition 3: Given a sequence of observations Y'i,...,Yfc following the distribution / with natural 
sufficient statistic f(-), and the diagnostic matrix with rank factorization H = LU, we define the diagnostic 
sufficient statistic (DSS) as 

k 

X^ = UY,tiYm). 

m=l 

Note that is an r-dimensional vector, which can also be viewed as an r-dimensional projection of 
the cumulative sum of natural sufficient statistic. We will write it as DSS from now on. Note that it 
depends not only on the observations, but also the natural parameters corresponding to all hypotheses. 
We use ; 6) = {ttq, ..., n’f) to denote the belief vector given the iid. observations Y = (Yj,..., Y^) 
and the prior belief 6 = (^o,..., 0^). The following proposition suggests that the {N + 1)-dimensional 
belief vector 11^(1^; 6) can be reconstructed from the r-dimensional vector x^. 

'Although rank factorization may not he unique, one may choose any one for our purpose. 
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Belief space Diagnostic Sufficient 

. Statistic (DSS) space 



Fig. 2. Illustration of the reachable belief space. 


Proposition 1 (Belief vector reconstruction): Under Assumption [H the belief veetor ean be 

reeonstrueted from the DSS, x^, through a mapping —)■ That is, Yi^(Y]0) = T^(x^]0), 

where = (Tq, ..., 7^), 


N 


9i 


< = e) = ( / exp + 1 


i=l 


k^J-k^^k.Q) A 


6j exp {Dj{x^)} 


+0o’ 




D^(x^) = CiLx^—k [B{cXi)—B{cxQ )~\, ej is an A^-dimensional unit row veetor with 1 at the jth eomponent. 

Remark 1: Proposition suggests that the belief veetors of interest ean be represented by a DSS with 
the dimension r = rank(i7) ^ {N,M}. Note in Tablethat M ^ 2 in many distributions eommonly used 
in praetice, meaning that the belief space reachable from the prior, 6, has an intrinsic dimension of one or 
two, even when the number of hypotheses, N, is large. A geometric interpretation is given in Figure]^ the 
subspace of the belief space reachable in k periods from 6 is an r-dimensional manifold (Bk) embedded 
in the A^-dimensional belief space. The nonlinear transformation “curls” the DSS space to form such 
manifold with the same intrinsic dimension as itself. 


B. Reformulating the Optimality Equation 

Now we reformulate the original optimality equation @ in the DSS space. Unlike the belief vector, the 
update of the DSS does not require the direct use of Bayes’ theorem; it only involves a simple summation, 
namely, x^^^ = x^ + Ut{Yk+i), as implied by Definition Consider a new value function J^(x;6) = 













TABLE II 

Belief-vector vs. DSS approach 



Belief-vector aprch. 

DSS aprch. 

Dimension of the state space 

N 

r = rank(IT) < min{V, M} 

State variable 

fP 

a:'' 

Value function 

U(rP) 

J^(aP7h) 

Stationary acceptance region 

Yes 

No 

Prior-dependent acceptance region 

No 

Yes 

Prior-dependent state 

Yes 

No 


defined by moving the time index k out of the original value funetion V{-) and emphasizing 11^’s 
dependenee on 0. The optimality equation involving J^{x;0) is given in the following eorollary: 

Corollary 1: Under Assumption the optimality equation @ ean be reformulated using the DSS, x, 
as follows: 

J\x;0) = min|4(ai;6>),..., (5) 

N 

i=0 

N N 

J‘(x;e) = ^7-*(x;0)c.+ / j‘+‘(x + f/tfe))^7-‘(x;0)<if’,(!/). 

1=0 i=l 

Remark 2: The funetion J^{x] 0) is defined on the r-dimensional state spaee where x resides. Sinee r ^ 
2 in many real settings listed in Table |I| we ean solve the (one or two-dimensional) problem with value 
iteration. When the distribution is continuous, some discretization of the DSS space will be required. To 
perform the value iteration, one will need to use the truncation method described in [17]|, [|T^ . That is, we 
first consider a finite-horizon problem, solve the optimality equation by backward induction, then increase 
the length of the horizon to a extent such that the value function converges (the convergence is guaranteed 
by the dominated convergence theorem). Note also that J^{x]0) explicitly depends on both time k and 
the prior 0 (as opposed to the original value function V (If) that is independent of k and 0). Accordingly, 
the acceptance regions become non-stationary and prior-dependent. We compare and contrast these two 
methods in Table although the decisions generated by them are the same. 

C. Applications to Open Problems 

We now apply the DSS-based approach to test the hypotheses concerning the normal distribution. 
Various suboptimal procedures have been developed for these seemingly basic problems, such as [fT^ . 
||8| , [[Toll , reviewed by [[20ll . However, no procedure has yet been known to be both optimal and 
scalable to a large number of hypotheses. 

1) Testing the mean of normal distribution: We begin with a standard problem of testing simple 
hypotheses about the mean of normal distribution, assuming that the variances are known. Suppose that 
independent scalar observations y = {Yi,..., Yj.} are sequentially drawn from one of -f 1 univariate 
normal distributions /j = N{pi, = 0,...,N, differing in the mean. We are concerned with the 
hypotheses Hi\Yk ^ Nidi, cr^), ^ = 0,..., A^. We have obtained the prior belief vector 0 = (bo, ■ ■ ■, On) 
reflecting our initial knowledge. The first step is to find the diagnostic matrix 




(bi - A^o)/o■^0 ■ 

. - r )^( OLo ) . 


. (bN - /ro)/c^^O . 
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Belief-vector 

approach 

-► 


Sequential data 

y = {yi, 2 / 2 ,1/3, 2 / 4 } 



Non-stationary acceptance intervals 


DSS approach 

-► 



J Accept Ho 
j Accept Hi 
J Accept Hi 
J Wait 


Fig. 3. Comparison of the belief-vector approach and DSS approach in Example 1. 


Clearly, the rank of this matrix is one, i.e., r = 1. A rank faetorization of H is 


L = 


/ /ti - /to) 


— A'-O 


y,u = {i,o). 


The eorresponding DSS is the eumulative sum of observations = Ylm=i Vm- The transformation 
, N, can be speeialized as 


N 


%\x-e) = 


T^x-e) 


/fo 


i=i ^ 

9,exp{xf^-k^~^° 


/^o 


2 a 2 
20-2 } 


+ 1 




The optimality equation ean be obtained by speeializing ([^ using T^{x), defined above, the natural 
suffieient statistie t{y) = (t/, and the eorresponding normal distribution funetion Fi{y). 


Example 1: Illustrative Example 

Consider the example with the following parameters: {po, pi, Pi) = (45,55,60), cr^ = 25, 11° = 
{So, 91 , 62 ) = (1/3,1/3,1/3), (co,ci,C 2 ) = (0.5,0.2,0.3), aoi = 2 ,092 = 5, oio = 3, ai 2 = 6,020 = 
4, 021 = 7, flu = 022 = 033 = 0. The simple hypotheses to be tested are Hq :Yk ^ N{A5, 25), i7i : T/ ~ 
77(55,25), 7^2 : Yk ~ A^(60,25). Four independent realizations of Y/, { 2 / 1 , 1 / 2 , 2/3, 2 / 4 } = {58,52,41,57}, 
are sequentially generated from the distribution A^(55,25). 

We first deseribe the elassieal belief-veetor approaeh. Based on the prior and sequential observation, 
we find the posterior belief veetors in sequenee: 11^ = (0.019,0.466,0.515), 11^ = (0.013,0.721,0.266), 
n° = (0.399,0.593,0.008), 11^ = (0.039,0.949,0.012). The first three belief veetors lie in the waiting 
region but the fourth vector falls in the acceptance region for hypothesis Hi, as shown in Figure (upper 
panel). Clearly, the sample path depends on the prior, but the acceptance regions do not and they remain 
fixed over time. 
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Accept 


Prior Oj O-i 
0 

0.2 

Posterior 7 rj o.i 
0 
0.4 

Posterior 0.2 
0 
0.5 

Posterior 7 rj° 

0 

0.5 

Posterior 

0 

Posterior 



Hypothesis index j 


Wait 


Wait 


Wait 


Accept H-i 


5 6 7 



Accejjt //§ 
Accept H-j 
'■ Accept Hq 
Accept 
Arc..,.. //, 
Accept //;} 
Accept H2 
Accept Hi 


Accept Hq 


ib) 


Fig. 4. Example 2: (a) belief vectors and (b) acceptance intervals on DSS. 


Next, we deseribe the proposed solution approaeh. The DSS sequenee is = 58, = 110, = 
151, x^ = 208. We eompare eaeh of them with the aeeeptanee intervals shown in Figure (lower panel) 
and also find that it is optimal to wait until the fourth period. The sequenee of aetions are, undoubtedly, 
the same as the belief-veetor approaeh. But the decision process now falls in one dimension. Note that 
the DSS is independent of the prior, but the acceptance intervals depend on both the prior and time. 

Remark 3: To implement this approach in real world, it is often desirable to use the sample aver- 
iYlm=i ym)/k as an equivalent to DSS so that the size of the state space does not increase with k. We 
note that the sufficient statistic (X]m=i has been used by many heuristic methods such as Sobel- 

Wald procedure [fT^ . ff^ . However, the decision rules of the optimal policy draw a sharp distinction 
from these heuristic methods. For example, the Sobel-Wald procedure combines multiple SPRT’s [|T^ . 
whereas the optimal method only requires a single integrated test. The Billard-Vagholkar procedure ll2^ 
prohibits accepting any hypothesis at the first few periods, but the optimal policy allows stopping even at 
the first period. 

Example 2: Flexible priors 

The proposed method is compatible with arbitrary nonzero prior beliefs. We will illustrate such flexibility 
using the following example with ten hypotheses, a size that cannot be efficiently solved by the belief- 
vector approach. 

The parameters in this example are /ij = 40 + 5f, af = = 100, q = 0.02 — O.Olf, aij = \i — j\+ 
0.5[max(j — i,0)]^, for i = 0,..., 9. To illustrate the flexibility on prior selection, we choose a prior 
distribution involving a trigonometric function Oi = (sin(f) + 1.5)/16.4 for f = 0,...,9, as shown in 
Figure |^a. This bimodal prior implies that Hi and Hy are the most likely whereas and are the 
least likely. After collecting a series of observations, the posterior probabilities for k = 1,5,10,15, 21 
are shown in the same figure. Figure |^b illustrates the optimal acceptance intervals on the DSS x^ for 
each k. These intervals are similar to those in Figure Indeed, increasing the number of hypotheses 
no longer requires going to higher dimensions; it only adds more intervals into this chart. It is clear 
that the waiting intervals gradually shrink as k increases, because the uncertainty would decrease as the 
information accumulate. 

2) Testing both mean and variance of normal distribution: When we are testing both the mean and 
variance, the cumulative sum of observations {y) may not be sufficient because the variance information 
is often better captured by the squared observations {y'^). Suppose that we sequentially observe y = 
{Yi,..., Yk] drawn from one of iV + 1 > 2 normal distributions fi = N{pi, a^),i = 0, ..., N. The goal 
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is to find the true distribution by testing the hypotheses Hi 
first find the diagnostic matrix 


Yk ~ N{fii, (Ti),i = 0,..., iV. As before, we 


H = 






0-2-0-2 
20-20-2 

_ rj'^ioLN) 

- _ 


-5—2- 

L 

V-^0 
2o'oV - 


Next, we define Q = (ag/Xj — affio)/{af — a^) and examine two cases: 

Case I: Q’s are identical: If Ci = C foi" * = 1) • • • > the matrix H has rank one (r = 1). Thus, 
the problem is in one dimension. This example illustrates a case where the two dimension-reduction 
opportunities (see Figure are both present. Although the dimension of the distribution is two (M = 2), 
the intrinsic dimension of the reachable belief space is still one. That is, r < M. A rank factorization 

is 2 2 2 2 

The DSS is a scalar given by 




2^2 


CTna 


)',[/= (1,1/2C). 


O^N 


= 


uY,t(Y^) vim 


m=l 


m=l 


Clearly, identical C,iS can arise when the means are identical but the variances differ = /io — d), or 
when the means are different but the variances are identical (e.g.. Example 1-2). However, it is important 
to note that identical ^/s can also appear when both mean and variance are different, for instance, when 
(/ig, /ii, /i 2 , A^s) = (0,1, 2, 3), ((jg, cr^, (j|, cr|) = (1, 2, 3,4), in which the rank of the matrix H is still one, 
and the corresponding DSS is still a scalar given by 


* ~ ^ 4- y^/2). 


m=l 


But this DSS is far less intuitive than the previous cases. 

Case II: Q’s are non-identical: If C/s are different, we have r = 2. A rank factorization is L = H ,U = 1 
(the identity matrix). In this case, the DSS is = (Em=i Em=i 2/m)^- The 

transformation T^{x]6) can be specialized as 


N 


Ta{xi,X 2 ]e) = (5^^exp I 


e 


2 

I J 




2^2 


i=l 


a^a. 


Xi -f 


at 


at 


2ala^ 


-X2 






2ala^ 


+ ln(^ 

ao 


THxi,X2]0) = ^j-exp 


I ^odj 


a|/ig 

-Xi+ ■' 


at 


at 


N 

E 

2=1 


9i exp 


f alta - affio 

I 


2aia] 


X2- k[ 




2aW ^ 




ao 


Xi + 


H ^0 


ai 


-X2 


+ln(^ 

2atat Vo 


+ 1 


-1 




We used the notation x = (xi, X 2 ) in the above expressions, where xi = Ylm=i ^2 = Z]m=i Vm- The 
two-dimensional form of the finite-horizon optimality equation can be obtained by specializing Q using 
the T^{x) defined as above, t{y) = as well as the corresponding normal distribution function 

F^{y). 
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Fig. 5. The acceptance regions for Example 3. 


Example 3 

Consider five hypotheses about normal distribution /*, i = 0,..., 4, differing in both mean and varianee. 
For the distribution /,, the mean is given by /ij = 30 + 4(z — 1)^/^ and the varianee follows the expression 
af = 74: — i. The prior is a uniform distribution given by 9i = 1/5. The eosts are q = 0.02 + O.Oli 
and ttij = \i — j |/6 + [max(i — j, 0)]^/12. The DSS spaee has a dimension r{H) = 2. Examples of the 
aeceptance regions are shown in Figure for fc = 1 and k = 4. The horizontal axis is the cumulative 
sum of observations Xi = Ylm=i *^he vertical axis is the cumulative sum of squared observations X 2 = 
T’' 

Z^m=l Vm- 


IV. Comparison with MSPRT 


From a practical point of view, it is important to know the magnitude of improvement that the 
optimal policy can provide over existing suboptimal policies. Recent developments are mainly based on 
asymptotically optimal policies, a good benchmark is the M-ary sequential probability ratio test (MSPRT) 
by [US. This procedure is known to be asymptotically optimal as the observation costs (or identification 
errors) approach zero [fT3l . iHHlI . 

Consider the case with three hypotheses about the mean of normal distribution, namely, H* : ~ 

= 0,1,2. Suppose that the observation costs are identical, i.e., c* = c and termination costs 
are zero-one, namely, aij = 1 if z 7 ^ j, = 0 if i = j. In this context, the MSPRT defines a series of 
Markov times r, = infjfc : vr/ > Ai}, where tt/ is the posterior probability for hypothesis i, and Ai is 
the corresponding constant threshold. The MSPRT stopping time is defined as the minimum Markov time 
T = minj{ri}, and the acceptance decision rule is: d = i if t = Ti. 

We perform simulation studies to compare the performances of MSPRT with the optimal policy for 
different combinations of observation costs c and means /ij. In this simulation experiment, we enumerate 
all combinations of MSPRT thresholds and choose the minimum cost as benchmark. The simulation is 
run long enough so that the width of the 95% confidence interval for estimated average cost is less than 
0.001. The estimated average costs and the MSPRT’s percentage of loss from optimal are shown in Table 

nni 


We observe in Table III that the sub-optimality of MSPRT becomes larger when the observation costs 
c are larger, or when the differences in the mean pi become smaller. Such observations are consistent 
with asymptotic optimality. They suggest that the optimal policy is more desirable when the hypotheses 
are more difficult to differentiate from one another, or when we cannot afford to take many observations 
for fear of increasing the response delay time. Nevertheless, MSPRT gives good approximation when the 
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TABLE III 

Comparison of the total cost between the optimal policy and MSPRT. 



(mi - Mol/'T 

= 0.2, 

to 

1 

0 

= 0.4 


c 

0.5 

0.4 

0.3 

0.2 

0.1 

Optimal 

1.328 

1.232 

1.140 

1.027 

0.936 

MSPRT 

6.749 

5.534 

4.329 

3.240 

1.999 

Error (%) 

408.2 

349.2 

279.7 

215.5 

113.5 


1 

0 

= 0.4, 

to 

1 

0 

= 0.6 


c 

0.5 

0.4 

0.3 

0.2 

0.1 

Optimal 

1.344 

1.230 

1.147 

1.045 

0.945 

MSPRT 

4.181 

3.315 

2.821 

2.140 

1.452 

Error (%) 

211.0 

169.5 

145.9 

104.8 

53.70 


1 

0 

= 0.8, 

(M2 - Mo)/o- 

= 1.4 


c 

0.5 

0.4 

0.3 

0.2 

0.1 

Optimal 

1.332 

1.199 

1.101 

1.023 

0.957 

MSPRT 

1.803 

1.582 

1.439 

1.185 

1.029 

Error (%) 

35.39 

31.99 

30.84 

15.89 

7.497 


1 

0 

= 1.0, 

to 

1 

0 

= 2.0 


c 

0.5 

0.4 

0.3 

0.2 

0.1 

Optimal 

1.280 

1.161 

1.120 

0.980 

0.976 

MSPRT 

1.287 

1.167 

1.124 

0.981 

0.977 

Error (%) 

0.546 

0.516 

0.330 

0.101 

0.092 


hypotheses are relatively easy to differentiate and when the observation eost is low. In these situations, 
one may argue that MSPRT serves as a satisfaetory alternative. 

Ineidentally, for oases in Table IIT the oomputation time of the optimal poliey ranges from 17.9 to 
24.6 seeonds in the MATLAB™ environment on a desktop eomputer with two 3.4 GHz Intel Core i7 
proeessors. This is far from prohibitive for applieations of hypothesis testing. 

The poor performanoe of MSPRT seen in some situations has intuitive explanations. First, the existenoe 
of a “minimum waiting region”, as shown in Figure |^a, oan inorease the delay time when the optimal 
waiting region is aotually small. Seoond, the MSPRT oannot fully oapture the sequential nature of the 
problem. This ean be illustrated by a simple example as follows: Consider the observation densities fi, 
as shown in Figure |^b. Let c = 0.1 and 


0 

Ooi 

< 7 o 2 


■Q 

1 

1 ■ 

Oio 

0 

< 7 i 2 

= 

1 

0 

0 

<720 

<721 

0 


1 

0 

0 


In this setting, the misidentifieation eost is zero if we mistake fi for / 2 , and viee versa; it is always one if 
we mistake /o for fi (or / 2 ), and viee versa. That is, the two hypotheses eorresponding to fi and /2 are 
identical in cost. By intuition, one may suggest grouping the two hypotheses as one hypothesis and work 
with the sum of probabilities tti + 712 or, equivalently, tto. If we do so, the eorresponding deeision region 
will appear to be triangular; its boundary will be a line segment parallel to the boundary of the belief 
spaee (an example is line segment AB, as shown in Figure |^o). As mentioned earlier, parallel boundaries 
of this sort are the essenee of MSPRT, whieh sets the threshold on individual posterior probability (in this 
ease, tto). Under this poliey, point A and B should have the same aetion. However, the optimal aetions 
are aetually different for the two points. To see why this is true, we should realize that /2 is easier to 
differentiate from /q as eompared with /i, so by waiting at point A, it is more likely to obtain a new 
observation with a higher diseriminating power (sinee the belief veetor at point A indieates that /2 is more 
likely). In other words, the expeeted value of new information is higher at point A; henee, one should 
be more “patient” and wait longer. However, this diserepaney between point A and B is lost in the sum 
711 + 712. If we were to make a one-shot deeision during standard (non-sequential) hypothesis testing, there 
would be no need to disseet sueh details within the sum. However, this subtlety is erueial in the sequential 
environment and ean only be eaptured by the optimal poliey. 
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(a) (b) (c) 

Fig. 6. The differences between MSPRT and optimal policy. 


V. The Sampling Control Problem 


We now extend the main results to the sequential multi-hypothesis testing problem with sampling 
eontrol, in whieh one ean adaptively ehoose among multiple alternative sampling modes with different di- 
agnostie powers and eosts. This subjeet is initiated by [|24ll . and still remains a vibrant area of researeh 

d. 

Consider a set of hypotheses denoted hy Hi, i G M, among whieh only one is true. The deeision maker 
is interested in finding the true hypothesis by eondueting sequential sampling and observation. At the 
deeision epoeh k, the deeision maker ean either aeeept one of the hypotheses and terminate the deeision 
proeess, or ehoose a sampling mode Uk from the sampling aetion set A = {1,... ,iC}. When the true 
hypothesis is Hi, the aetion a G A generates an observation Yk G y C with probability density (or 
mass) funetion /“. We assume that the funetions f^, a E A,i G M are known and the observations YkS 
are independent eonditional on the aetion and hypothesis. A sequential poliey 5 = (r, A’", d) eontains 
the stopping time r, the sequential sampling aetions A'^ = {oi,... , 0 ^- 1 }, and the aeeeptanee deeision 
rule d A'^ X {Yi,..., Yr-i} —^ A/”. Let 0 = (Oq, Oi,..., 9 n) e be the prior belief veetor. 

Suppose that /“ belongs to an exponential family with the natural parameter and the natural suffieient 
statistie tj, namely, /“(?/) = f{y;cx'^) = h{y)exp [r]'^{a'^)t{y) - 5(q:“)]. Let = (Yi,..., Y/,) be the 
observation sequenee generated by the sampling-mode sequenee A’^ = {ai,..., ak) E A’^. Let kla — {m E 
{1,... ,k} : Om = a} be the set of deeision periods (up to time fc < r) at whieh the sampling mode a E A 
is used. Further, let ka ENhe the eardinality of the set representing the total number of times of using 
the sampling mode a E A before stopping. Clearly, k}, and ka = k for k < r. 

Define an x {MK -f K) diagnostie matrix as follows 


H. 




r/T(Q:f) - ) 

r]'^{a§) - r/T(Q;^) 




B{cc^) - B{cc^) - 
B{cx^) - B{cc^) . ’ 


Let Tg = rank(ifs) denote its rank, so < min{A^, MiF + K}. Consider a rank faetorization, Hg = 
LgUg, where Lg is an A^-by-r^ matrix of full eolumn rank and Ug is an rg-hy-{MK + K) matrix of full 
row rank. 

Definition 4: Define the DSS as 

x] = Ug{ Y, Y 

rnGHi 

Let n^(Y^,A^;0) be the belief veetor eonditional on the observation sequenee Y^, sampling-mode 
sequenees A^, the prior belief 0 and the time index k. For brevity, we use to denote the {i + l)th 
eomponent of this belief veetor. The following result suggests that this belief veetor ean be reeonstrueted 
from the DSS, Xg, defined above. 
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Proposition 2: There is a mapping : M'’" —)■ , sueh that = T^{Xg-,6). More 

specifically, = {T ^,..., 7^), and 


N 


Oi 


4 = ri(x^;e) ^CE/exp {D^(x^)} + 1 


i=l 


-1 


n^^=THx]-0)^ 




Ei=i^iexp{T)f(ai^)} +0o' 


j = 1,...,N, 


where DHx^A 


= eiL.,x: 


Remark 4: When the number of sampling control actions are small as compared with the number of 
hypotheses, it might be beneficial to use the DSS-based approach. Further, if there are many control 
actions but each action can cause systematic change in the observation distributions (see Example 4), then 
the rank of the diagnostic matrix may still be low and the DSS approach is still preferred. When there is 
only one sampling mode, i.e., K = 1, the DSS becomes ^ ,k') , 

consisting the r-dimensional sufficient statistic x^ introduced in Definition]^ and the time index k (which 


accounts for the non-stationary acceptance regions). Thus, the classical problem discussed in Section |in 
is a special case of the sampling control problem by letting K = 1. 


Example 4: adaptive sample size 

For a random variable X, suppose that the population distribution of X is normal N{p,a‘^), with 
known variance but unknown mean p. To test multiple simple hypotheses about the mean, namely Hi : 
N{pi,a'^),i = 0,..., A^, the decision maker can take multiple samples at once, or one by one, before 
accepting a hypothesis. Generally speaking, taking multiple samples at once is not equivalent to taking 
the same number of samples sequentially, because the latter allows one to stop at any time, before all 
samples are observed. At the decision period m, she can choose the sample size G {l,2,...,iT} and 
observe the sample average X = -A- Clearly X ~ N{p,a‘^/am)- If the hypothesis Hi is true, 

then X ~ N{pi, /am), which implies that = N{pi, lam),i = 0,..., A^. Note that an action can 
cause a global change in the variances of all hypotheses. For convenience, let /tq = 0. The diagnostic 
matrix becomes 



ri 

0, 

on 

0, ■■ 


0, 

2o-2 5 

2m? 

2o-2 5 

Am? 1 
’ 20-2 

Hs = 

r-N 

0, 

o Miv 

" 9 , 

0, ■■ 


0, 


SMat 

Krh 


L ’ 

2o-2 5 

2o-2 5 

1 20-2 J 


whose rank is two (unless /ij’s are identical). A DSS is 

Ym),^akay, 

a=l m£Qa a=l 

in which J2a=i I^e total number of samples taken up to the period k. 

VI. Concluding Remarks 

The generalization of Wald’s SPRT to multiple hypotheses has been widely discussed. The structure 
of the optimal policy is well understood but the optimal policy itself is difficult to implement in general. 
We find that, for exponential families, it is possible to devise an efficient solution method, which is 
scalable to a large number of simple hypotheses under flexible priors in most practical cases. The method 
reconstructs the belief vector using the so-called diagnostic sufficient statistic and reformulate the original 
dynamic programming in a low-dimensional space, whose dimensionality is determined by the rank of 
the diagnostic matrix. The resulting control policy is distinct from the standard belief-vector approach 
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in the sense that the aeeeptanee regions are non-stationary and prior-dependent. The optimal solution 
is partieularly desirable when the alternative hypotheses are diffieult to differentiate and when a quick 
decision has to be made. 

We finally note that the diagnostic sufficient statistic is constructed using both the information from ex¬ 
ponential family (i.e., opportunity 1) and those parameters associated with all hypotheses (i.e., opportunity 
2). One needs to use the low-dimensional diagnostic sufficient statistic to reconstruct the high-dimensional 
belief vector and reformulate the optimality equation. This across-dimension reconstruction technique is 
the key to this solution. 


Appendix A 

Proof of Proposition [I] 

We first show that ttq can be reconstructed from x^. By the definition of Tq, we have 


k N n ^ 

0) = ^ = {d^U t{Y, 

m=l i=l ^ m=l 


^ exp jcjif t{Y^) - k [B{oLi) - 5(q:o)] | 

i=l m=l 

N a k 

^ exp { [n'^icxi) - 'n'^icxo)] t{YJs - k [B{cti) - B{cxo)] | 


+ 1 


+ 1 


-1 


-1 


+ 1 


i=l 

N 

E 

2 = 1 


m=l 


-1 


Oj exp Em=l ^O^rn) " kBjcXj 

do exp [n'^icxo) Em=i tO^m) “ kB{ao)] 


+ 1 


-1 


^ A nm=i (h(ym) exp [ri'^{cXi)t{Y^) - B{oLi)] 

^- 


=1 11™=! (^^(>^m)exp {cxo)t{Ym) - 5(q:o)] 

N 

E 


+ 1 


-1 




^ ^0 Um=lfiYm]OLo) 


^ k 1 


2=1 
'l/c 


where the third equality follows from the definition of Df and H = LU, the fourth equality follows 
from the definition of H, the fifth and sixth equalities involve some algebraic manipulation and the 
reconstruction of the exponential family, the seventh equality follows from the definition of exponential 
family and the iid. assumption, the eighth equality follows from the Baye’ rule, and the last equality 


follows because ~ 1- Using the same technique, other components can also be reconstructed 

from For j = 1,..., A^, we have 


7-‘(iE‘;e) = 7-‘(£;y]t(r„)ie) 


0j exp {D'‘(U Ym-i ‘(J"™))} 


EtA^^p{muEl=Ayn^))} + eo 

9j exp {ejLU Ym=i ^O^rn) - k [B{oLj) - .B(q:o)] } 

Oi exp [eiLU YlYi ^ - ^(q=o)] } + 6'o 

dj exp { - r7'^(Q:o)] Em=i ^0^^) - k [B{oLj) - B{ao)] } 


Ef=i^.exp{[r;T(a,) -r;T(ao)] 

dj exp I iaj)t(Ym}-kB(aj)'^ 

^'oexp I y]^^^r7T(Q!o)t(Vm)-fcS(Q!o)} _ 

y^N Sj exp { Em=l Yi(Xi)t(Ym)-kB{ai)'^ ^ ^ 

6»o exp I Y{(^o)'tO'm,)-kB{oL(i )} 


k[B{cxi)-B{cxo)\]+do 

flm^l ^(Xm) e^Y>{Y{a-j)t(ym)-B{oLj)} 
6»0 nm=i K^m) exp{r7T(ao)t(Vm)-S(ao)} 


E iV 
2=1 


N 9i fafTm) exp{T7'''(Q^)t(y„,)-_B(Q;j)} 

Oo 0^=1 ^(Tm) exp{r7'''(Q:o)t(ym)-B(Q:o)} 


+ 1 
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rim^l fjO^m) 

Ylm — 1 foO^m) 

rim^l fiO^rn) 
^*=1 9onl=iMYm) 


+ 1 



k 

0 


Ef=i vrf/TTo^ + 1 



By now we have shown that T^(ai^;0) = {Tq{x’^] 6),... 6)) 

thereby completing the proof. 


TT, 


0 ) 






Appendix B 

Proof of Proposition [2] 

This proof is an extension of the proof of Proposition Although there are some redundancy, we still 
show the full derivation for completeness. We first consider ttq. By the definition of Tq, we have 

m£Q,i 


N 


e, 


+ 1 


i=l m£Qi itiGQk 

N . 

X]^exp{eiif,( Y - ■ ■ ,kKy^ + 1 

i=l ^ mGS^i mSQiy 

N ^ K k 


i=l 

N 


a=l 


m£fla 




+ 1 


+ 1 


-1 


-1 


2=1 


m=l 
^k 


m=l 


^ / A exp [Ym=l - Ym=l B{c4 


N 

-(E 


:r [eLi v^iK'WY^) - ELi b(“S 

Oi 111=1 (h{Y^)exp [ri'^{cx^^)t{Y^) - 


+ 1 


-1 


*=i nl=i (h{Y^)exp [77^(0:“™)t(F^) - 5(0:“'")]) 


+ 1 


-1 


N 

E 


Ut=Jiym;cxr 


A ^0 llm=l fAm] ac 


N 1. 
TT, 


+ d - ET + i -J. 


- TTn 

i=l 


In the fifth equality, we switch from the summation over the sampling modes to the summation over time. 
Applying similar arguments to other component yields: 

, T 




m£fli 




e, exp t\ym). .... t\ym). ^ 1 , . . . , 

Eil exp (f/,( EmeOi , EmeOx ki,... ,kKy)] + Oo 

Oj exp { Ef=i ([^"^( 0 :“) - AAA Eleo. tAm) - - B{cx: 


OJj '^a. 


Ef=i^.exp{Ef=i([r/E<)- 
^iexp{El=i [AAA 


Emeo. - [BAi) - baa A) } + Oo 

■ AAAAAA - El=i [BAA) - BAA)]} 


Ef=i ^.exp {El=i [AAA) - AAA)]tiyA - El=i [bAA) - bAA)] } + Oo 
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9j exp 1 



do exp 1 

E™=i r?TK-)t(Y„)-Et=l BK"*)} 


{ Et=i BK™) 


Uq exp 

{ Et=i BK-) 



ejnt=rf!^(Yr^) 




e*n™=i/r(y™) I 1 V^_,7r^/7rfc^l 

0 onl.,fS-iYr.) + ^ 


= TT, 


J ’ 


rim^i fe(ym)exp{T?T(ct°™)f(y^)-B(a°»")} 
So rim^i HYm) exp{T?'r(ctg™)t(y„)-_B(ctg™)} 

V-A^ s^ n^^l h(Ym) exp{r7T(c,°"»)f 

^*=1 00 KY^) exp{T,T(a“-)t(F^)-B(a“™)} 


+ 1 


thereby completing the proof. 
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